Skip to content

refactor(cardano): shard, reorder, and merge the EWRAP boundary pipeline#978

Open
scarmuega wants to merge 15 commits intomainfrom
feat/shard-ewrap-work-units
Open

refactor(cardano): shard, reorder, and merge the EWRAP boundary pipeline#978
scarmuega wants to merge 15 commits intomainfrom
feat/shard-ewrap-work-units

Conversation

@scarmuega
Copy link
Copy Markdown
Member

@scarmuega scarmuega commented Apr 25, 2026

Summary

Restructures the EWRAP epoch-boundary pipeline across six commits:

  • Shard EWRAP into prepare/shard/finalize work units (948d472e, pre-existing) — initial split of the monolithic EWRAP into three phases.
  • Rename EwrapPrepareEwrap, seed EpochState.end in ESTART (9627c3ae) — drop the misleading "Prepare" name; ESTART's EpochTransition now opens the end slot.
  • Rename EwrapShardAccountShard (ae260792) — name the per-account work unit by what it processes, not its position in the pipeline.
  • Reorder: AccountShard runs before Ewrap (eaf8e08c) — settle per-account effects first, then run the global Ewrap visitors. EpochEndInit becomes a PATCH delta; EpochEndAccumulate accepts ewrap_progress = None as the natural shard-0 starting state.
  • Merge EwrapFinalize into Ewrap; drop EpochEndInit (a1e0d5cf) — collapse the redundant finalize phase into Ewrap; the wrap-up visitor now assembles the full EndStats (prepare-time fields + shard accumulators) and emits a single EpochWrapUp. One fewer state-machine state, one fewer delta type, one fewer commit cycle. Boundary close is now atomic in a single state-writer commit.
  • Seed EpochState.end in Genesis (5122e075) — restores the invariant that end = Some(...) whenever an AccountShard runs (the first boundary's shards otherwise hit a fresh genesis state with end = None).

Final pipeline: Estart → Roll … → Rupd → Roll … → AccountShard ×N → Ewrap (where Ewrap now both runs the global visitors and closes the boundary).

Test plan

  • cargo check -p dolos-cardano clean
  • cargo test -p dolos-cardano — all 98 unit tests pass (work-buffer state-machine, epoch delta proptests, etc.)
  • cargo build --tests — full workspace compiles
  • cargo test --test memory passes
  • cargo test --test epoch_pots --release — gold-standard end-to-end against DBSync ground truth (requires populated Mithril test instance)
  • Manual bootstrap from Mithril to verify the genesis seeding fix in 5122e075 resolved the original panic

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added an AShard work unit to run per-shard, per-account reward application.
    • Added account_shards configuration (defaults to 16) to control shard parallelism.
  • Improvements

    • Restructured epoch-boundary pipeline: rewards are applied per-shard and final wrap-up runs once per boundary for more predictable processing and crash recovery.
    • Reduced peak memory during key-range iteration.
  • Tests & Docs

    • Added memory-bounded iteration tests and updated pipeline documentation.

scarmuega and others added 6 commits April 24, 2026 17:40
Partitions the epoch-boundary EWRAP work unit — previously one monolithic
work unit that materialised O(active_accounts) in memory (rewards map,
deltas, logs, applied_rewards) — into three phase-specific work units that
each commit independently:

* EwrapPrepare: global classification (pools/dreps/proposals), MIRs,
  enactment + refund visitors for non-account entities, emits EpochEndInit
  seeding EpochState.end with the prepare-time globals and zeroed reward
  accumulators.
* EwrapShard(i): range-scoped (first-byte prefix bucket) load of pending
  rewards + accounts, runs rewards + drops visitors per account, emits
  EpochEndAccumulate with the shard's reward contribution.
* EwrapFinalize: reads the accumulated EpochState.end, emits EpochWrapUp
  (which transitions rolling/pparams snapshots and clears ewrap_progress).

Cross-shard handoff piggy-backs on EpochState rather than a new entity:
ewrap_progress: Option<u32> is the durable cursor and EpochState.end
accumulates across shards via the new deltas.

WorkBuffer gains EwrapShardingBoundary{shard_index, total_shards} and
EwrapFinaliseBoundary states; pop_work now takes ewrap_total_shards from
CardanoConfig (default 16). EpochEndAccumulate has an idempotency guard
keyed on ewrap_progress so shard re-execution after a crash is safe.

Detection-only crash recovery at initialize time logs a warning when
ewrap_progress is set; full block-rehydration resume is flagged as TODO.

Memory tests in tests/memory.rs verify both fjall and redb3 honour
range-scoped iter_entities with O(1) heap — the load-bearing property for
the shard design.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n ESTART

Decouple two responsibilities that were tangled in EwrapPrepareWorkUnit:
the global epoch-boundary entity processing (now plain `Ewrap`) and the
structural opening of the `EpochState.end` slot (now done by ESTART's
`EpochTransition`). Ewrap's `EpochEndInit` delta keeps its overwrite
semantics; it now writes into a default-seeded slot rather than from
None.

Also adds `prev_end` / `prev_ewrap_progress` undo fields to
`EpochTransition` (serialized, like the other prev_* fields) so a
rollback after restart correctly restores them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The per-account leg of the epoch close was named after its position in the
EWRAP pipeline; AccountShard names what it actually does — apply rewards
and pool/drep delegator drops over a key-range slice of the account
namespace.

Also renames the related symbols (BoundaryWork::load_shard /
commit_shard → load_account_shard / commit_account_shard,
WorkBuffer::EwrapShardingBoundary → AccountShardingBoundary,
InternalWorkUnit::EwrapShard → AccountShard). The user-facing
`ewrap_total_shards` config field is intentionally preserved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…line

The epoch-boundary sequence is now AccountShard ×N → Ewrap → EwrapFinalize
(was Ewrap → AccountShard ×N → EwrapFinalize). Per-account work settles
first; the global Ewrap phase then patches the prepare-time fields onto an
EpochState.end that already has its reward accumulators populated.

State machine: WorkBuffer::pop_work transitions reordered, and
on_ewrap_boundary now takes ewrap_total_shards so the restart-at-boundary
entry can construct AccountShardingBoundary directly. The
total_shards == 0 defensive branch now skips to EwrapBoundary (global
phase) instead of EwrapFinaliseBoundary.

Delta semantics:
- EpochEndInit::apply is now a PATCH — writes only the prepare-time
  fields (pool counts, epoch_incentives, MIR amounts, proposal refunds)
  and leaves the accumulator fields alone. ewrap_progress is no longer
  touched by this delta. Dropped the unused prev_ewrap_progress field.
- EpochEndAccumulate::apply treats ewrap_progress = None as the natural
  starting state for shard 0 (unwrap_or(0) as the expected cursor).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…delta

The boundary close is now a single Ewrap work unit: it runs the global
visitors AND emits EpochWrapUp carrying the assembled final EndStats
(prepare-time fields combined with the AccountShard-populated accumulator
fields). The wrap-up visitor now constructs the final stats locally
instead of routing them through a separate EpochEndInit delta.

Side-benefits: one fewer state-machine state, one fewer delta type, one
fewer commit cycle. Atomicity also improves — the boundary close is now
a single state-writer commit, so a crash between Ewrap and EwrapFinalize
is no longer possible.

Test fixture in tests/epoch_pots/main.rs restructured to match the
post-reorder pipeline: accumulator reset gates on AccountShard
shard_index == 0; rewards CSV is dumped on the Ewrap arm.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After the boundary-pipeline reorder (AccountShard runs before Ewrap),
the first epoch's AccountShard hits `EpochEndAccumulate::apply` with
`entity.end == None` because Genesis bootstrapped the EpochState before
ESTART's `EpochTransition` had a chance to seed the slot. Seed
`end = Some(EndStats::default())` directly in Genesis to match the
invariant ESTART maintains for every subsequent epoch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 25, 2026

📝 Walkthrough

Walkthrough

Adds an AShard work-unit phase that shards per-account epoch-boundary reward application by entity-key range, snapshots per-shard accumulators into epoch state, and refactors Ewrap to handle global MIR/refund/finalization after all shards commit.

Changes

Cohort / File(s) Summary
AShard Core
crates/cardano/src/ashard/mod.rs, crates/cardano/src/ashard/work_unit.rs, crates/cardano/src/ashard/loading.rs, crates/cardano/src/ashard/commit.rs
New AShard work-unit: load_ashard applies per-account pending rewards within a key-range, emits per-shard EpochEndAccumulate, and commit_ashard writes drained logs, updates shard accumulators, and commits state/archive.
Shard Helpers & Rewards
crates/cardano/src/ashard/shard.rs, crates/cardano/src/ashard/rewards.rs
Adds shard range partitioning (first-byte prefix, validation) and adapts rewards visitor to use crate::ewrap types and shard-scoped reward retention.
Ewrap Refactor
crates/cardano/src/ewrap/loading.rs, crates/cardano/src/ewrap/commit.rs, crates/cardano/src/ewrap/work_unit.rs
Replaces load/compute/commit flow with load_ewrap/compute_ewrap_deltas/commit_ewrap; streaming helper gains optional range filter; Ewrap now handles MIR and boundary closure (no per-account reward application).
Ewrap State & Wrap-up
crates/cardano/src/ewrap/mod.rs, crates/cardano/src/ewrap/wrapup.rs
Exposes ending_state (pub(crate)), adds shard accumulator fields to BoundaryWork, and changes EndStats derivation to use epoch-state accumulators saved into EpochState.end.
Epoch State & Deltas
crates/cardano/src/model/epochs.rs, crates/cardano/src/model/mod.rs
Adds ashard_progress to EpochState, AShardProgress snapshot type, new EpochEndAccumulate delta (apply/undo with idempotency guards), and wires delta into CardanoDelta enum and epoch transition logic.
Work Pipeline & Buffering
crates/cardano/src/work.rs, crates/cardano/src/lib.rs
Introduces AShardingBoundary and InternalWorkUnit::AShard; wires AShardWorkUnit into CardanoWorkUnit and work-materialization; receive_block/pop_work now accept account_shards and drive per-shard emission.
Config & Genesis
crates/core/src/config.rs, crates/cardano/src/genesis/mod.rs
Adds CardanoConfig.account_shards with default 16 and accessor; genesis bootstraps EpochState.end = Some(EndStats::default()) and sets initial ashard_progress = None.
Ewrap/Bdry Reorg & Visitors
crates/cardano/src/ewrap/mod.rs, crates/cardano/src/ewrap/rewards.rs
Moves per-account applied-rewards concept to AShard; removes rewards export from ewrap and adjusts visitor signatures/usages.
Docs & Tests
crates/cardano/work_units.md, crates/cardano/src/model/pending.rs, crates/cardano/src/rupd/work_unit.rs, skills/debug-epoch-mismatch/SKILL.md, tests/epoch_pots/main.rs, tests/memory.rs
Adds work-units doc; updates docstrings to reflect AShard/Ewrap responsibilities; test changes: accumulate applied_rewards across AShard units for CSV dump and add memory-boundedness tests for range iteration.
Misc: Error Type
crates/core/src/lib.rs
Adds ChainError::InvalidConfig(String) variant for config validation errors.

Sequence Diagram(s)

sequenceDiagram
    participant Boundary as WorkBuffer/Boundary
    participant AShard as AShardWorkUnit
    participant BWLoad as BoundaryWork::load_ashard
    participant BWCommit as BoundaryWork::commit_ashard
    participant Epoch as EpochState
    participant Ewrap as EwrapWorkUnit

    Boundary->>AShard: emit AShard(slot, shard_index)
    AShard->>BWLoad: load_ashard(state, genesis, shard_index, total_shards, range)
    BWLoad->>BWLoad: load pending rewards for range\niterate accounts, apply rewards+drops
    BWLoad->>Epoch: snapshot shard accumulators\nemit EpochEndAccumulate delta
    AShard->>BWCommit: commit_ashard(state, archive, range)
    BWCommit->>Epoch: write drained logs, update ashard_progress
    BWCommit->>Boundary: commit writers
    Boundary->>Ewrap: after all shards -> emit Ewrap
    Ewrap->>Ewrap: load_ewrap(), apply MIR/refunds, finalize EndStats
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

Possibly related PRs

Poem

🐰 Hop, I sorted shards so neat,

Keys in ranges, memory sweet.
Rewards applied in bounded stride,
Epochs gather what each did provide.
AShard hops — then Ewrap's pride. 🌿

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: refactoring the EWRAP boundary pipeline by introducing sharding, reordering, and merging phases.
Docstring Coverage ✅ Passed Docstring coverage is 80.46% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/shard-ewrap-work-units

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

scarmuega and others added 6 commits April 25, 2026 11:51
Pulls the AccountShard work unit out of `ewrap/` into a peer module
`ashard/` (matching the layout of `estart/`, `rupd/`, `roll/`,
`genesis/`). The shared `BoundaryWork` / `BoundaryVisitor` infrastructure
and the drops visitor (used by both phases) stay in `ewrap/`; `ashard/`
imports them.

Moves: `rewards.rs`, `shard.rs`, `AccountShardWorkUnit` (from
`work_unit.rs`), and the `load_*` / `commit_*` impl blocks. Visibility
on shared `BoundaryWork` helpers (`new_empty`, `load_pool_data`,
`load_drep_data`, `stream_and_apply_namespace`) widened from private to
`pub(crate)`. The `ending_state` field also widened to `pub(crate)` so
peer modules can mutate it (e.g. `wrapup.flush` already does this).

Method/identifier renames to match the new module path:
- `BoundaryWork::load_account_shard` → `load_ashard`
- `BoundaryWork::commit_account_shard` → `commit_ashard`
- `WorkUnit::name()` returns `"ashard"`

Type name `AccountShardWorkUnit` is preserved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mments

Sweeps the docstrings/comments touched in this PR for references to
phases, work units, and deltas that no longer exist after the rename /
reorder / merge / split sequence:

- Restore the in-place explanation for the "rewards before drops"
  HACK in `ashard/loading.rs` (the dangling "see comment on the
  pre-shard path" pointed to a comment that was deleted when the
  prepare phase was removed).
- Drop "prepare phase" / "finalize phase" wording from `BoundaryWork`
  field docstrings, `commit_ewrap` comments, and `loading.rs` section
  dividers — neither phase exists; there's only Ewrap (global + close)
  and AccountShard (per-account).
- Update the ESTART `EpochTransition` description in `work_units.md`
  so it reflects the post-merge data flow: AccountShards populate the
  accumulators directly, then Ewrap reads them back and emits
  `EpochWrapUp` with the final `EndStats` (no `EpochEndInit` patch step
  anymore).
- Rename `compute_prepare_deltas` → `compute_ewrap_deltas`. The
  "prepare" name was a leftover from the `EwrapPrepare` work unit; the
  method is now the only Ewrap-phase compute helper.
- Tighten `load_pending_rewards_range` docstring; flag that the `None`
  branch is currently unused.

No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The shard-related identifiers and comments were named after the legacy
EWRAP pipeline that bundled the global epoch-boundary work and the
per-account shards together. With AccountShard now a distinct work unit
in its own module, those names are misleading. Rename to use the
`ashard` prefix consistently with the module path:

- `CardanoConfig::ewrap_total_shards` → `ashard_total`
- `CardanoConfig::DEFAULT_EWRAP_TOTAL_SHARDS` → `DEFAULT_ASHARD_TOTAL`
- `EpochState::ewrap_progress` → `ashard_progress`
- `prev_ewrap_progress` → `prev_ashard_progress` on `EpochEndAccumulate`,
  `EpochWrapUp`, and `EpochTransition`
- `WorkBuffer::receive_block` / `on_ewrap_boundary` / `pop_work`
  parameter `ewrap_total_shards` → `ashard_total`
- Error messages in `ashard/shard.rs` updated to match.

Also fixes comment / doc misattributions where "EWRAP" was used for
work that's now in `AccountShard`:
- `PendingRewardState` / `DequeueReward` are consumed by `AccountShard`,
  not Ewrap.
- `PendingMirState` / `DequeueMir` are consumed by Ewrap (clarified).
- `AppliedReward` and the `applied_rewards` field are populated during
  AccountShard, not Ewrap.
- RUPD's docstring now says rewards are consumed by `AccountShard`.
- Crash-recovery wording in `lib.rs` says "mid-boundary" instead of
  "mid-EWRAP" since the cursor specifically tracks AccountShard
  progress.

BREAKING CONFIG CHANGE: existing `dolos.toml` files that explicitly set
`ewrap_total_shards` need to rename the key to `ashard_total`. Users
relying on the default (omitted) are unaffected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Aligns the type and variant names with the module path convention:

- struct `AccountShardWorkUnit` → `AShardWorkUnit`
- enum variant `CardanoWorkUnit::AccountShard` → `AShard`
- enum variant `InternalWorkUnit::AccountShard` → `AShard`
- WorkBuffer state `AccountShardingBoundary` → `AShardingBoundary`
- module re-export and all callers updated to match
- prose / docstrings / log messages also use `AShard` consistently

The module path is `crate::ashard`, so the type now reads as
`ashard::AShardWorkUnit`. No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per review feedback: the user-facing config name should be self-explanatory
in `dolos.toml`. Renames everywhere for consistency:

- `CardanoConfig::ashard_total` field → `account_shards`
- `CardanoConfig::ashard_total()` accessor → `account_shards()`
- `CardanoConfig::DEFAULT_ASHARD_TOTAL` → `DEFAULT_ACCOUNT_SHARDS`
- WorkBuffer parameters and error messages updated to match.

BREAKING CONFIG CHANGE: existing `dolos.toml` files that explicitly set
this option (under any prior name from this PR) need to use
`account_shards`. Users relying on the default are unaffected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Guards against a config change to `account_shards` corrupting an
in-flight boundary. Previously, if dolos crashed mid-boundary and the
operator changed `account_shards` between crash and restart, the resume
would re-partition the account key space with the new count, mismatching
the cursor's already-committed shards.

Fix: snapshot the boundary's shard count into state at the first
`EpochEndAccumulate` apply. The persisted total is authoritative for the
duration of the in-flight boundary; the new config value only takes
effect on the next boundary.

Changes:
- New `AShardProgress { committed, total }` struct stored at
  `EpochState.ashard_progress: Option<AShardProgress>` (was
  `Option<u32>`).
- `EpochEndAccumulate` carries `total_shards`. Its apply validates the
  delta's `total_shards` matches any previously persisted total and
  surfaces an error if they diverge (would only happen if a work unit
  was constructed with a stale config view).
- `EpochWrapUp` and `EpochTransition` undo fields adapted to the new
  type.
- `AShardWorkUnit::load` / `commit_state` read the persisted total when
  present and fall back to `config.account_shards()` for fresh
  boundaries.
- `CardanoLogic` caches `effective_account_shards` (= persisted total
  if a boundary is in flight, else config). Refreshed at every
  `pop_work` call so `receive_block` (which has no state access) can
  use the up-to-date value when constructing
  `WorkBuffer::AShardingBoundary`.
- Crash-recovery wording updated to surface a clear warning when the
  persisted total disagrees with current config.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@scarmuega scarmuega marked this pull request as ready for review April 26, 2026 13:46
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
crates/cardano/src/ashard/rewards.rs (1)

114-188: ⚠️ Potential issue | 🟡 Minor

Log message clarity: add shard context to per-shard telemetry.

The flush method (lines 114–188) now runs once per AShard, but the telemetry messages ("rewards remaining before drain", "SPENDABLE REWARDS LEFT UNPROCESSED", etc.) don't indicate shard scope. Per-shard loading correctly isolates each shard's RewardMap via load_pending_rewards_range(state, Some(range)), so the drain and error detection are functionally sound. However, for operational visibility, these log statements should include shard index and total_shards to clarify they reflect per-shard metrics, not boundary-wide state. This is especially helpful for debugging when multiple shards report metrics in parallel.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/cardano/src/ashard/rewards.rs` around lines 114 - 188, The per-shard
logs in flush don't indicate which shard they belong to; update the flush
logging to include shard context (shard_index and total_shards). Add
shard_index: usize and total_shards: usize as fields on the AShard struct (or
otherwise make them available on self or ctx), ensure any AShard
construction/population passes these values, and then include %shard_index and
%total_shards in the tracing::debug! and tracing::error! invocations (and the
debug! at the end) around ctx.rewards/ drain_unspendable/
applied_reward_credentials so all per-shard telemetry shows shard scope.
crates/cardano/src/model/epochs.rs (1)

980-981: ⚠️ Potential issue | 🟡 Minor

Fix the unused_doc_comments rustdoc warning flagged by CI.

Pipeline reports rustdoc does not generate documentation for macro invocations on these lines. The /// comment immediately above the prop_compose! macro invocation is dropped on the floor by rustdoc and triggers the lint. Convert to a regular // comment (or move it inside the closure body if you want it preserved).

🛠️ Proposed fix
-    /// `EpochStatsUpdate::apply` calls `rolling.live_mut` which asserts `next` is None,
-    /// so we need a specialized generator that keeps `rolling.next` empty.
+    // `EpochStatsUpdate::apply` calls `rolling.live_mut` which asserts `next` is None,
+    // so we need a specialized generator that keeps `rolling.next` empty.
     prop_compose! {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/cardano/src/model/epochs.rs` around lines 980 - 981, The rustdoc
warning comes from using a /// doc comment immediately above the prop_compose!
macro (related to EpochStatsUpdate::apply and rolling.live_mut which asserts
rolling.next is None), so replace that leading /// comment with a non-doc
comment (//) or move the explanation into the generator body/closure so rustdoc
won't try to document the macro invocation; specifically update the comment
above the prop_compose! invocation that references rolling.live_mut and
rolling.next to use // (or relocate it inside the prop_compose! closure) to
silence the unused_doc_comments lint.
🧹 Nitpick comments (12)
crates/cardano/src/ashard/rewards.rs (1)

56-77: Stale EWRAP reference in comment.

The comment at lines 59–61 still says "registered at RUPD startStep time but deregistered before EWRAP". After this refactor the per-account registration check happens during AShard, not Ewrap. Suggest updating to "deregistered before AShard" (or "before the boundary's per-account leg") to keep the comment in lockstep with the renamed pipeline.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/cardano/src/ashard/rewards.rs` around lines 56 - 77, Update the stale
comment referencing "EWRAP" to reflect the new pipeline stage: change the text
around the account deregistration explanation to say "deregistered before
AShard" (or "before the boundary's per-account leg") so it matches where the
per-account registration check now runs; the relevant area to edit is the
comment above the account.is_registered() branch in the function that calls
reward.total_value() and ctx.rewards.return_reward_to_treasury(total). Ensure
the updated comment keeps the existing clarification about pre-filtered accounts
and returned_rewards behavior.
crates/cardano/src/ashard/loading.rs (1)

19-57: Optional: drop the Option<Range<EntityKey>> wrapper.

The doc-comment already states load_ashard is the only caller and always passes Some(range). Making the parameter Range<EntityKey> directly removes a never-taken branch and tightens the contract. If a future "load all" caller appears, reintroducing the option is trivial.

♻️ Suggested refactor
-    fn load_pending_rewards_range<D: Domain>(
-        &mut self,
-        state: &D::State,
-        range: Option<Range<EntityKey>>,
-    ) -> Result<(), ChainError> {
-        let pending_iter = state
-            .iter_entities_typed::<PendingRewardState>(PendingRewardState::NS, range)?;
+    fn load_pending_rewards_range<D: Domain>(
+        &mut self,
+        state: &D::State,
+        range: Range<EntityKey>,
+    ) -> Result<(), ChainError> {
+        let pending_iter = state
+            .iter_entities_typed::<PendingRewardState>(PendingRewardState::NS, Some(range))?;

And drop the Some(range.clone()) at the call site (line 81) to pass range.clone() directly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/cardano/src/ashard/loading.rs` around lines 19 - 57, The function
load_pending_rewards_range currently accepts Option<Range<EntityKey>> but the
only caller (load_ashard) always passes Some(range), so tighten the signature to
take Range<EntityKey> directly: change the parameter type in
load_pending_rewards_range and update callers (e.g., in load_ashard remove
Some(range.clone()) and pass range.clone() directly); ensure the body drops any
pattern-matching for None (no-op) and uses the range value when calling
state.iter_entities_typed::<PendingRewardState>(PendingRewardState::NS, range).
tests/epoch_pots/main.rs (1)

500-513: Optional: collapse the two if let Some(boundary) checks.

The shard-index==0 reset and the per-shard extend can share a single boundary binding, which trims a branch and makes the relationship between "capture ending epoch" and "extend rewards" obvious.

♻️ Suggested refactor
                 CardanoWorkUnit::AShard(shard) => {
-                    if shard.shard_index() == 0 {
-                        // First shard of this boundary — reset accumulator
-                        // and capture the ending epoch.
-                        accumulated_applied.clear();
-                        if let Some(boundary) = shard.boundary() {
-                            accumulated_ending_epoch =
-                                Some(boundary.ending_state().number);
-                        }
-                    }
-                    if let Some(boundary) = shard.boundary() {
+                    if let Some(boundary) = shard.boundary() {
+                        if shard.shard_index() == 0 {
+                            // First shard of this boundary — reset
+                            // accumulator and capture the ending epoch.
+                            accumulated_applied.clear();
+                            accumulated_ending_epoch =
+                                Some(boundary.ending_state().number);
+                        }
                         accumulated_applied.extend(boundary.applied_rewards.iter().cloned());
                     }
                 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/epoch_pots/main.rs` around lines 500 - 513, Collapse the two separate
`if let Some(boundary)` checks into a single binding for `boundary` when
matching `CardanoWorkUnit::AShard(shard)`: inside one `if let Some(boundary) =
shard.boundary()` block first check `if shard.shard_index() == 0` to clear
`accumulated_applied` and set `accumulated_ending_epoch =
Some(boundary.ending_state().number)`, then unconditionally call
`accumulated_applied.extend(boundary.applied_rewards.iter().cloned())`; this
removes the duplicated boundary lookup and makes the reset-and-extend logic
contiguous.
crates/cardano/work_units.md (1)

5-11: Add a language to the fenced block (MD040).

markdownlint-cli2 flagged this fence as missing a language. Use text (or any non-rendered identifier) so the diagram passes lint.

📝 Suggested fix
-```
+```text
 Estart  →  Roll …  →  Rupd  →  Roll …  →  AShard ×N  →  Ewrap
 (open)     (blocks)   (RUPD)   (blocks)    (per-account)         (global + close)
                                                                           │
                                                                           ▼
                                                                   next epoch's Estart
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @crates/cardano/work_units.md around lines 5 - 11, The fenced code block
containing the epoch diagram starting with "Estart → Roll … → Rupd → Roll
… → AShard ×N → Ewrap" is missing a language tag (MD040); fix it by adding a
language identifier (use "text") immediately after the opening triple backticks
so the block becomes ```text and the diagram passes markdownlint-cli2
validation.


</details>

</blockquote></details>
<details>
<summary>crates/cardano/src/lib.rs (2)</summary><blockquote>

`284-341`: **Consolidate duplicated `effective_account_shards` lookup logic.**

The inline computation at lines 338–341 is identical to the new helper `read_effective_account_shards` (lines 253–258). On top of that, `load_epoch::<D>(state)` is invoked twice within `initialize` (line 292 for the warnings, line 338 to derive `effective_account_shards`). A small refactor avoids the double read and keeps the “persisted total else config” rule in one place.



<details>
<summary>♻️ Proposed refactor</summary>

Lift the helper to a free fn and reuse it (and reuse the loaded `EpochState` for the warnings):

```diff
-    /// Compute the effective `account_shards` value: stored
-    /// `ashard_progress.total` if a boundary is in flight, otherwise the
-    /// configured value.
-    fn read_effective_account_shards<D: Domain>(&self, state: &D::State) -> u32 {
-        load_epoch::<D>(state)
-            .ok()
-            .and_then(|e| e.ashard_progress.as_ref().map(|p| p.total))
-            .unwrap_or_else(|| self.config.account_shards())
-    }
+    /// Compute the effective `account_shards` value: stored
+    /// `ashard_progress.total` if a boundary is in flight, otherwise the
+    /// configured value.
+    fn read_effective_account_shards<D: Domain>(&self, state: &D::State) -> u32 {
+        effective_account_shards_from_state::<D>(state, &self.config)
+    }
+}
+
+fn effective_account_shards_from_state<D: Domain>(state: &D::State, config: &CardanoConfig) -> u32 {
+    load_epoch::<D>(state)
+        .ok()
+        .and_then(|e| e.ashard_progress.as_ref().map(|p| p.total))
+        .unwrap_or_else(|| config.account_shards())
 }
```

And in `initialize`, compute once and reuse the loaded epoch for both the warnings and the cached value.
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@crates/cardano/src/lib.rs` around lines 284 - 341, Extract and reuse the
epoch load and the effective-account-shards logic: in initialize(), call
load_epoch::<D>(state) once and store the resulting EpochState (if Ok) for the
warnings; replace the inline computation of effective_account_shards with a call
to the existing helper read_effective_account_shards (or lift that helper to a
free fn if necessary) so the "persisted total else config" rule is implemented
in one place; update references to use the single loaded epoch value for the
tracing::warn! blocks and derive effective_account_shards from
read_effective_account_shards(state, &config) (or equivalent) to remove the
duplicate load_epoch::<D>(state) call.
```

</details>

---

`404-408`: **Per-`pop_work` state read may be unnecessary outside boundaries.**

`effective_account_shards` is refreshed on every `pop_work` call (each block cycle) via a `read_entity_typed` lookup, but it only matters at epoch boundaries. The cost is small (single point read), but you could skip the read when the cached value is non-stale by only refreshing when `ashard_progress` could have changed (i.e., before an actual boundary handoff). Worth considering if `pop_work` shows up in profiling.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@crates/cardano/src/lib.rs` around lines 404 - 408, The code refreshes
self.effective_account_shards on every pop_work by calling
read_effective_account_shards::<D>(domain.state()), but this value only matters
when the epoch boundary handoff (EpochState.ashard_progress.total) can change;
avoid the repeated read by first checking whether ashard_progress has changed
and only calling read_effective_account_shards when it has. Add a small cached
marker on the struct (e.g., last_ashard_total or last_ashard_progress) and in
pop_work compare domain.state().ashard_progress.total (or the appropriate
accessor) to that cached value; if different, call read_effective_account_shards
and update both effective_account_shards and the cached marker, otherwise skip
the read and keep the existing effective_account_shards.
```

</details>

</blockquote></details>
<details>
<summary>crates/cardano/src/ewrap/commit.rs (1)</summary><blockquote>

`91-100`: **Full-table account scan to apply a handful of MIR rewards.**

`stream_and_apply_namespace::<D, AccountState>(state, &writer, None)` walks every account in state to surface the small set with MIR-queued deltas. The inline comment acknowledges this is "effectively a targeted write via the streaming path", but on mainnet-sized account tables this is a significant cost paid every boundary, regardless of how few MIRs are pending (often zero).

Consider a direct path: iterate `self.applied_mir_credentials` (or the credentials of the queued `AssignRewards` deltas), point-read each `AccountState`, apply, and write. The streaming helper can stay for the full-set namespaces (Pool/DRep/Proposal/EpochState), but accounts during Ewrap are inherently sparse.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@crates/cardano/src/ewrap/commit.rs` around lines 91 - 100, The current call
to stream_and_apply_namespace::<D, AccountState>(state, &writer, None) performs
a full-table scan of AccountState each boundary; instead implement a targeted
path that iterates over self.applied_mir_credentials (or the credentials from
queued AssignRewards deltas), point-reads each AccountState, applies the MIR
delta, and writes the result via the same writer so only affected accounts are
touched; keep stream_and_apply_namespace for full-set namespaces like
Pool/DRep/Proposal/EpochState, and ensure the new code reuses the existing
apply/serialize logic used by stream_and_apply_namespace to maintain invariants
and error handling.
```

</details>

</blockquote></details>
<details>
<summary>crates/cardano/src/ashard/work_unit.rs (1)</summary><blockquote>

`62-129`: **Avoid recomputing `total_shards`/`range` from state in `commit_state`.**

`load` computes `total_shards` and `range` from persisted state, then calls `BoundaryWork::load_ashard(..., shard_index, total_shards, range)` — so the boundary already owns these values. `commit_state` repeats the same state read and recomputation, with two minor concerns:

1. **Code duplication**: identical 8-line block in both methods (and a third copy in `crates/cardano/src/lib.rs`).
2. **Subtle drift risk**: if `commit_state` ever observed a different `ashard_progress.total` than `load` did (e.g., a shard's apply landing between the two phases of the same work unit), the recomputed `range` would diverge from what was loaded. Today this can't happen, but cross-phase reuse of the captured values is more obviously correct.

Consider exposing the captured shard info from `BoundaryWork` (or storing the computed `range` in `AShardWorkUnit` after `load`) and have `commit_state` reuse it instead of re-deriving.



<details>
<summary>♻️ Sketch</summary>

```diff
 pub struct AShardWorkUnit {
     slot: BlockSlot,
     config: CardanoConfig,
     genesis: Arc<Genesis>,
     shard_index: u32,
-    boundary: Option<BoundaryWork>,
+    boundary: Option<BoundaryWork>,
+    /// Captured during `load` so `commit_state` reuses the exact same range.
+    range: Option<std::ops::Range<dolos_core::EntityKey>>,
 }
```

Then `load` stores `self.range = Some(range.clone())` and `commit_state` consumes it.
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@crates/cardano/src/ashard/work_unit.rs` around lines 62 - 129, The
load/commit_state duplication and drift risk can be fixed by capturing the
computed total_shards and range during load and reusing them in commit_state
instead of re-reading state: after computing total_shards and range in load
(before calling BoundaryWork::load_ashard), store the values on the work unit
(e.g., add fields self.total_shards and/or self.range and assign them there),
and then change commit_state to take the boundary via self.boundary.as_mut() and
use the stored range/total_shards when calling boundary.commit_ashard::<D>(...)
rather than recomputing via load_epoch; update BoundaryWork usage only to
consume the already-stored range if needed.
```

</details>

</blockquote></details>
<details>
<summary>crates/cardano/src/work.rs (2)</summary><blockquote>

`276-279`: **Test `TEST_TOTAL_SHARDS = 1` doesn't cover multi-shard sequencing in this state machine.**

These tests collapse `AShard` into the `EWrap` tag and set `TEST_TOTAL_SHARDS = 1`, so the multi-shard loop in `pop_work` (lines 226–245) — including the `next_index >= total_shards` transition — is exercised only with `total_shards = 1`. Worth adding at least one test with `TEST_TOTAL_SHARDS > 1` that asserts the work buffer emits exactly N `AShard` units before transitioning to `EwrapBoundary` (and that `last_point_seen` / cursor invariants hold across that transition). This catches off-by-one and ordering regressions in the boundary state machine that the current tests would silently miss.



Want me to draft this test?


Also applies to: 325-330

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@crates/cardano/src/work.rs` around lines 276 - 279, Add a new unit test that
sets TEST_TOTAL_SHARDS > 1 (instead of the current const TEST_TOTAL_SHARDS = 1)
to exercise the multi-shard sequencing in the pop_work logic: drive the
WorkBuffer/state machine (invoking pop_work) and assert that it emits exactly N
consecutive AShard units (N == TEST_TOTAL_SHARDS) before producing an
EwrapBoundary (or EWrap/Ewrap as used in this code), and verify that
last_point_seen and any cursor invariants are preserved across the transition;
target the test to exercise the pop_work path that contains the next_index >=
total_shards branch and use the existing helpers/setup used by the other tests
so it integrates with the same state machine and coverage.
```

</details>

---

`120-143`: **`account_shards == 0` "skip-shards" fallback contradicts `validate_total_shards`.**

`crates/cardano/src/ashard/shard.rs::validate_total_shards` rejects `0` outright ("account_shards must be >= 1"), yet both `on_ewrap_boundary` and the `PreEwrapBoundary` arm of `pop_work` treat `0` as "no shards — go straight to Ewrap". If a config bug or missing validation ever hands `0` here, you'll silently skip per-account boundary processing instead of failing fast. The defensive comment ("Shouldn't happen with a valid config") tacitly acknowledges this.

Either drop the branches and let an `unreachable!()` / `assert!(account_shards >= 1)` fail loudly, or change `validate_total_shards` to accept `0` as "no shards" if that's the intended semantics. Right now the two pieces disagree.




Also applies to: 210-225

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@crates/cardano/src/work.rs` around lines 120 - 143, The code currently treats
account_shards == 0 as a valid "no shards" path in on_ewrap_boundary (and
similarly in the PreEwrapBoundary arm of pop_work), which contradicts
validate_total_shards that requires account_shards >= 1; change the behavior to
fail-fast: remove the special-case branch that returns EwrapBoundary when
account_shards == 0 and instead assert or panic if account_shards < 1 (e.g., add
assert!(account_shards >= 1) at the start of on_ewrap_boundary), keeping the
normal Restart -> AShardingBoundary path for valid shard counts; alternatively,
if the intended semantics are that 0 means "no shards", update
validate_total_shards to accept 0 and document that invariant—pick one
consistent approach and apply the same fix in the PreEwrapBoundary/pop_work code
paths so both agree with validate_total_shards.
```

</details>

</blockquote></details>
<details>
<summary>crates/cardano/src/ashard/shard.rs (1)</summary><blockquote>

`37-60`: **`debug_assert!` makes invalid `total_shards` a runtime div-by-zero in release.**

The contract is documented and there's a `validate_total_shards` helper, but the only enforcement inside `shard_key_range` is `debug_assert!`. In release builds, `total_shards == 0` reaches `let step = PREFIX_SPACE / total_shards;` and panics with a less informative message; non-divisor values silently produce truncated `step` arithmetic.

Since this is a public function and the cost of a real assertion (or returning `Result`) is negligible compared to the I/O the caller does, prefer `assert!` (or propagate the error from `validate_total_shards`).



<details>
<summary>🛡️ Proposed change</summary>

```diff
 pub fn shard_key_range(shard_index: u32, total_shards: u32) -> Range<EntityKey> {
-    debug_assert!(validate_total_shards(total_shards).is_ok());
-    debug_assert!(shard_index < total_shards);
+    assert!(
+        validate_total_shards(total_shards).is_ok(),
+        "invalid total_shards: {total_shards}"
+    );
+    assert!(shard_index < total_shards, "shard_index {shard_index} >= total_shards {total_shards}");
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@crates/cardano/src/ashard/shard.rs` around lines 37 - 60, The function
shard_key_range uses debug_assert! to check validate_total_shards(total_shards)
and shard_index < total_shards, which leaves a division-by-zero and silent
truncation in release; change the debug_assert! checks to real runtime checks
(e.g., assert!(validate_total_shards(total_shards).is_ok()) and
assert!(shard_index < total_shards)) or convert the function to return Result
and propagate validate_total_shards() error from shard_key_range so PREFIX_SPACE
/ total_shards cannot divide by zero and invalid total_shards are handled
deterministically; update callers if you choose the Result approach and keep
references to shard_key_range, validate_total_shards, and PREFIX_SPACE when
making the change.
```

</details>

</blockquote></details>
<details>
<summary>crates/cardano/src/ewrap/loading.rs (1)</summary><blockquote>

`280-285`: **`shard_applied_*` fields are dead state on Ewrap-constructed `BoundaryWork`.**

`new_empty` zero-initializes `shard_applied_effective`, `shard_applied_unspendable_to_treasury`, and `shard_applied_unspendable_to_reserves`, but Ewrap never reads/writes them — the per-shard accumulators land in `EpochState.end` via `EpochEndAccumulate` and are read from `ending_state.end` by `wrapup.flush`. The fields exist only because `BoundaryWork` is shared with the AShard path.

Not a bug, but the dual-use API obscures intent. If a future change adds a third caller, it's easy to misuse these fields. Consider either splitting `BoundaryWork` into `AShardWork`/`EwrapWork` or documenting on the struct which fields each phase owns.

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@crates/cardano/src/ewrap/loading.rs` around lines 280 - 285, BoundaryWork's
shard_applied_effective / shard_applied_unspendable_to_treasury /
shard_applied_unspendable_to_reserves are dead for the Ewrap path (zeroed in
new_empty and never read by Ewrap; real per-shard accumulators live in
EpochState.end via EpochEndAccumulate and are consumed by wrapup.flush), so fix
by either splitting the dual-purpose struct into two clear types (e.g.,
AShardWork and EwrapWork) and move the AShard-only shard_applied_* fields into
AShardWork, or add explicit documentation/comments on BoundaryWork and those
three fields noting they are owned/used only by the AShard path; update
new_empty, any constructors, and usages of BoundaryWork (including references in
EpochEndAccumulate and wrapup.flush) to use the new types or to reflect the
documented ownership to prevent future misuse.
```

</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @crates/cardano/src/model/epochs.rs:

  • Around line 737-740: The apply path currently does
    entity.end.as_mut().expect("ESTART seeded EpochState.end before shards run")
    which panics if EpochState.end is None; instead ensure end is initialized to
    EndStats::default() when missing (so subsequent EpochWrapUp can overwrite it).
    Update the apply code in crates/cardano/src/model/epochs.rs to replace the
    expect with a safe initialization (e.g., use get_or_insert_with / get_or_insert
    or similar) so entity.end becomes Some(EndStats::default()) if it was None
    before mutating it.
  • Around line 684-771: EpochEndAccumulate::undo currently assumes apply always
    mutated state, causing underflow/corruption when apply returned early; fix by
    recording in apply whether it performed mutations and the prior values needed to
    revert (capture previous end.effective_rewards, end.unspendable_to_treasury,
    end.unspendable_to_reserves and previous entity.ashard_progress/AShardProgress)
    into fields on EpochEndAccumulate (or a was_applied flag plus prev_* fields),
    then make undo a no-op if apply was skipped or restore those captured prev_*
    values and previous ashard_progress instead of unconditionally subtracting and
    setting committed/total; update apply (function EpochEndAccumulate::apply) to
    populate these fields only when it actually mutates state and add a proptest
    exercising apply→undo when ashard_progress is already advanced past
    completed_shard_index to lock the behavior.

In @crates/core/src/config.rs:

  • Around line 988-1002: Add runtime validation to reject invalid account_shards
    values by calling the existing validate_total_shards() from
    crates/cardano/src/ashard/shard.rs during config initialization: when
    constructing/using CardanoConfig (e.g., inside CardanoLogic::initialize() before
    any use of CardanoConfig::account_shards() or DEFAULT_ACCOUNT_SHARDS), call
    validate_total_shards(self.account_shards.unwrap_or(Self::DEFAULT_ACCOUNT_SHARDS))
    and return/propagate an error if it fails; ensure the initialization path stops
    the startup with a clear error rather than allowing invalid values (0 or
    non-divisors of 256) to be used.

In @skills/debug-epoch-mismatch/SKILL.md:

  • Line 138: The documentation still treats reward application as a single EWRAP
    work unit but the code split moves per-account reward application into AShard
    while Ewrap now handles MIR/global closure; update the tables that list or
    reference EWRAP (notably the "Classify the Mismatch", "Dolos Work Units →
    Haskell Concepts", and the instrumentation hypothesis table) to distinguish
    AShard vs Ewrap: point applyRUpd + reward filtering, unspendable routing and
    registration filtering to AShard and point MIR and boundary close to Ewrap, and
    adjust any bisection/debugging guidance so it routes to AShard for per-account
    reward issues and to Ewrap for MIR/global closure issues. Ensure references
    include the file/class names AShard, Ewrap and the function/concept names
    applyRUpd, MIR, reward filtering, boundary close to make locating code
    straightforward.

Outside diff comments:
In @crates/cardano/src/ashard/rewards.rs:

  • Around line 114-188: The per-shard logs in flush don't indicate which shard
    they belong to; update the flush logging to include shard context (shard_index
    and total_shards). Add shard_index: usize and total_shards: usize as fields on
    the AShard struct (or otherwise make them available on self or ctx), ensure any
    AShard construction/population passes these values, and then include
    %shard_index and %total_shards in the tracing::debug! and tracing::error!
    invocations (and the debug! at the end) around ctx.rewards/ drain_unspendable/
    applied_reward_credentials so all per-shard telemetry shows shard scope.

In @crates/cardano/src/model/epochs.rs:

  • Around line 980-981: The rustdoc warning comes from using a /// doc comment
    immediately above the prop_compose! macro (related to EpochStatsUpdate::apply
    and rolling.live_mut which asserts rolling.next is None), so replace that
    leading /// comment with a non-doc comment (//) or move the explanation into the
    generator body/closure so rustdoc won't try to document the macro invocation;
    specifically update the comment above the prop_compose! invocation that
    references rolling.live_mut and rolling.next to use // (or relocate it inside
    the prop_compose! closure) to silence the unused_doc_comments lint.

Nitpick comments:
In @crates/cardano/src/ashard/loading.rs:

  • Around line 19-57: The function load_pending_rewards_range currently accepts
    Option<Range> but the only caller (load_ashard) always passes
    Some(range), so tighten the signature to take Range directly: change
    the parameter type in load_pending_rewards_range and update callers (e.g., in
    load_ashard remove Some(range.clone()) and pass range.clone() directly); ensure
    the body drops any pattern-matching for None (no-op) and uses the range value
    when calling
    state.iter_entities_typed::(PendingRewardState::NS, range).

In @crates/cardano/src/ashard/rewards.rs:

  • Around line 56-77: Update the stale comment referencing "EWRAP" to reflect the
    new pipeline stage: change the text around the account deregistration
    explanation to say "deregistered before AShard" (or "before the boundary's
    per-account leg") so it matches where the per-account registration check now
    runs; the relevant area to edit is the comment above the account.is_registered()
    branch in the function that calls reward.total_value() and
    ctx.rewards.return_reward_to_treasury(total). Ensure the updated comment keeps
    the existing clarification about pre-filtered accounts and returned_rewards
    behavior.

In @crates/cardano/src/ashard/shard.rs:

  • Around line 37-60: The function shard_key_range uses debug_assert! to check
    validate_total_shards(total_shards) and shard_index < total_shards, which leaves
    a division-by-zero and silent truncation in release; change the debug_assert!
    checks to real runtime checks (e.g.,
    assert!(validate_total_shards(total_shards).is_ok()) and assert!(shard_index <
    total_shards)) or convert the function to return Result and propagate
    validate_total_shards() error from shard_key_range so PREFIX_SPACE /
    total_shards cannot divide by zero and invalid total_shards are handled
    deterministically; update callers if you choose the Result approach and keep
    references to shard_key_range, validate_total_shards, and PREFIX_SPACE when
    making the change.

In @crates/cardano/src/ashard/work_unit.rs:

  • Around line 62-129: The load/commit_state duplication and drift risk can be
    fixed by capturing the computed total_shards and range during load and reusing
    them in commit_state instead of re-reading state: after computing total_shards
    and range in load (before calling BoundaryWork::load_ashard), store the values
    on the work unit (e.g., add fields self.total_shards and/or self.range and
    assign them there), and then change commit_state to take the boundary via
    self.boundary.as_mut() and use the stored range/total_shards when calling
    boundary.commit_ashard::(...) rather than recomputing via load_epoch; update
    BoundaryWork usage only to consume the already-stored range if needed.

In @crates/cardano/src/ewrap/commit.rs:

  • Around line 91-100: The current call to stream_and_apply_namespace::<D,
    AccountState>(state, &writer, None) performs a full-table scan of AccountState
    each boundary; instead implement a targeted path that iterates over
    self.applied_mir_credentials (or the credentials from queued AssignRewards
    deltas), point-reads each AccountState, applies the MIR delta, and writes the
    result via the same writer so only affected accounts are touched; keep
    stream_and_apply_namespace for full-set namespaces like
    Pool/DRep/Proposal/EpochState, and ensure the new code reuses the existing
    apply/serialize logic used by stream_and_apply_namespace to maintain invariants
    and error handling.

In @crates/cardano/src/ewrap/loading.rs:

  • Around line 280-285: BoundaryWork's shard_applied_effective /
    shard_applied_unspendable_to_treasury / shard_applied_unspendable_to_reserves
    are dead for the Ewrap path (zeroed in new_empty and never read by Ewrap; real
    per-shard accumulators live in EpochState.end via EpochEndAccumulate and are
    consumed by wrapup.flush), so fix by either splitting the dual-purpose struct
    into two clear types (e.g., AShardWork and EwrapWork) and move the AShard-only
    shard_applied_* fields into AShardWork, or add explicit documentation/comments
    on BoundaryWork and those three fields noting they are owned/used only by the
    AShard path; update new_empty, any constructors, and usages of BoundaryWork
    (including references in EpochEndAccumulate and wrapup.flush) to use the new
    types or to reflect the documented ownership to prevent future misuse.

In @crates/cardano/src/lib.rs:

  • Around line 284-341: Extract and reuse the epoch load and the
    effective-account-shards logic: in initialize(), call load_epoch::(state)
    once and store the resulting EpochState (if Ok) for the warnings; replace the
    inline computation of effective_account_shards with a call to the existing
    helper read_effective_account_shards (or lift that helper to a free fn if
    necessary) so the "persisted total else config" rule is implemented in one
    place; update references to use the single loaded epoch value for the
    tracing::warn! blocks and derive effective_account_shards from
    read_effective_account_shards(state, &config) (or equivalent) to remove the
    duplicate load_epoch::(state) call.
  • Around line 404-408: The code refreshes self.effective_account_shards on every
    pop_work by calling read_effective_account_shards::(domain.state()), but this
    value only matters when the epoch boundary handoff
    (EpochState.ashard_progress.total) can change; avoid the repeated read by first
    checking whether ashard_progress has changed and only calling
    read_effective_account_shards when it has. Add a small cached marker on the
    struct (e.g., last_ashard_total or last_ashard_progress) and in pop_work compare
    domain.state().ashard_progress.total (or the appropriate accessor) to that
    cached value; if different, call read_effective_account_shards and update both
    effective_account_shards and the cached marker, otherwise skip the read and keep
    the existing effective_account_shards.

In @crates/cardano/src/work.rs:

  • Around line 276-279: Add a new unit test that sets TEST_TOTAL_SHARDS > 1
    (instead of the current const TEST_TOTAL_SHARDS = 1) to exercise the multi-shard
    sequencing in the pop_work logic: drive the WorkBuffer/state machine (invoking
    pop_work) and assert that it emits exactly N consecutive AShard units (N ==
    TEST_TOTAL_SHARDS) before producing an EwrapBoundary (or EWrap/Ewrap as used in
    this code), and verify that last_point_seen and any cursor invariants are
    preserved across the transition; target the test to exercise the pop_work path
    that contains the next_index >= total_shards branch and use the existing
    helpers/setup used by the other tests so it integrates with the same state
    machine and coverage.
  • Around line 120-143: The code currently treats account_shards == 0 as a valid
    "no shards" path in on_ewrap_boundary (and similarly in the PreEwrapBoundary arm
    of pop_work), which contradicts validate_total_shards that requires
    account_shards >= 1; change the behavior to fail-fast: remove the special-case
    branch that returns EwrapBoundary when account_shards == 0 and instead assert or
    panic if account_shards < 1 (e.g., add assert!(account_shards >= 1) at the start
    of on_ewrap_boundary), keeping the normal Restart -> AShardingBoundary path for
    valid shard counts; alternatively, if the intended semantics are that 0 means
    "no shards", update validate_total_shards to accept 0 and document that
    invariant—pick one consistent approach and apply the same fix in the
    PreEwrapBoundary/pop_work code paths so both agree with validate_total_shards.

In @crates/cardano/work_units.md:

  • Around line 5-11: The fenced code block containing the epoch diagram starting
    with "Estart → Roll … → Rupd → Roll … → AShard ×N → Ewrap" is missing
    a language tag (MD040); fix it by adding a language identifier (use "text")
    immediately after the opening triple backticks so the block becomes ```text and
    the diagram passes markdownlint-cli2 validation.

In @tests/epoch_pots/main.rs:

  • Around line 500-513: Collapse the two separate if let Some(boundary) checks
    into a single binding for boundary when matching
    CardanoWorkUnit::AShard(shard): inside one if let Some(boundary) = shard.boundary() block first check if shard.shard_index() == 0 to clear
    accumulated_applied and set accumulated_ending_epoch = Some(boundary.ending_state().number), then unconditionally call
    accumulated_applied.extend(boundary.applied_rewards.iter().cloned()); this
    removes the duplicated boundary lookup and makes the reset-and-extend logic
    contiguous.

</details>

<details>
<summary>🪄 Autofix (Beta)</summary>

Fix all unresolved CodeRabbit comments on this PR:

- [ ] <!-- {"checkboxId": "4b0d0e0a-96d7-4f10-b296-3a18ea78f0b9"} --> Push a commit to this branch (recommended)
- [ ] <!-- {"checkboxId": "ff5b1114-7d8c-49e6-8ac1-43f82af23a33"} --> Create a new PR with the fixes

</details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: defaults

**Review profile**: CHILL

**Plan**: Pro

**Run ID**: `13b48d06-fb3d-459b-b0a0-59ac2b808397`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between 885094deed721abaa8bca14087703bd7ea409c28 and 8174d2a832bd81baa5228d8b3059b0b2e0bfd7ba.

</details>

<details>
<summary>📒 Files selected for processing (23)</summary>

* `crates/cardano/src/ashard/commit.rs`
* `crates/cardano/src/ashard/loading.rs`
* `crates/cardano/src/ashard/mod.rs`
* `crates/cardano/src/ashard/rewards.rs`
* `crates/cardano/src/ashard/shard.rs`
* `crates/cardano/src/ashard/work_unit.rs`
* `crates/cardano/src/ewrap/commit.rs`
* `crates/cardano/src/ewrap/loading.rs`
* `crates/cardano/src/ewrap/mod.rs`
* `crates/cardano/src/ewrap/work_unit.rs`
* `crates/cardano/src/ewrap/wrapup.rs`
* `crates/cardano/src/genesis/mod.rs`
* `crates/cardano/src/lib.rs`
* `crates/cardano/src/model/epochs.rs`
* `crates/cardano/src/model/mod.rs`
* `crates/cardano/src/model/pending.rs`
* `crates/cardano/src/rupd/work_unit.rs`
* `crates/cardano/src/work.rs`
* `crates/cardano/work_units.md`
* `crates/core/src/config.rs`
* `skills/debug-epoch-mismatch/SKILL.md`
* `tests/epoch_pots/main.rs`
* `tests/memory.rs`

</details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

Comment thread crates/cardano/src/model/epochs.rs
Comment on lines +737 to +740
let end = entity
.end
.as_mut()
.expect("ESTART seeded EpochState.end before shards run");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

apply panics if EpochState.end is None — fragile coupling to Genesis seeding.

The expect("ESTART seeded EpochState.end before shards run") relies on Genesis seeding end = Some(EndStats::default()) for the first boundary, and on EpochTransition re-seeding it on every subsequent transition. This is correct on the happy path but turns any future code path that constructs an EpochState with end = None (test fixtures, migration tools, recovery utilities, a genesis bootstrap regression) into a hard panic deep inside delta application.

Cheaper to defend: initialize end to EndStats::default() if absent rather than panic. The EpochWrapUp delta later overwrites it with the final stats, so semantics are preserved.

🛡️ Suggested change
-        let end = entity
-            .end
-            .as_mut()
-            .expect("ESTART seeded EpochState.end before shards run");
+        let end = entity.end.get_or_insert_with(EndStats::default);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let end = entity
.end
.as_mut()
.expect("ESTART seeded EpochState.end before shards run");
let end = entity.end.get_or_insert_with(EndStats::default);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/cardano/src/model/epochs.rs` around lines 737 - 740, The apply path
currently does entity.end.as_mut().expect("ESTART seeded EpochState.end before
shards run") which panics if EpochState.end is None; instead ensure end is
initialized to EndStats::default() when missing (so subsequent EpochWrapUp can
overwrite it). Update the apply code in crates/cardano/src/model/epochs.rs to
replace the expect with a safe initialization (e.g., use get_or_insert_with /
get_or_insert or similar) so entity.end becomes Some(EndStats::default()) if it
was None before mutating it.

Comment thread crates/core/src/config.rs
Comment thread skills/debug-epoch-mismatch/SKILL.md Outdated
scarmuega and others added 3 commits April 26, 2026 12:29
apply has three early-return guards (idempotent repeat, out-of-order,
total_shards mismatch) that leave state untouched. undo unconditionally
subtracted the deltas and overwrote ashard_progress, so a rollback
following a skipped apply would underflow the u64 end.* fields and
clobber the cursor.

Capture prev_ashard_progress and set an applied flag during apply only
when state is actually mutated; undo early-returns when !applied and
restores from the snapshot. Same pattern as EpochWrapUp/EpochStatsUpdate.

Also broaden any_epoch_state to vary ashard_progress so the existing
roundtrip proptests for EpochWrapUp and EpochTransition exercise the
Some(_) → None → Some(_) path their apply/undo introduced earlier in
this branch (previously only None → None was covered). Add a dedicated
epoch_end_accumulate_roundtrip proptest covering all four progress
shapes and all three skip branches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The divides-256 invariant on account_shards is enforced only via
debug_assert! in shard_key_range(), which is stripped in release
builds. An invalid TOML value (0, 3, 7, 100, ...) would deserialize
cleanly and silently corrupt key-range coverage.

Call validate_total_shards() at the top of CardanoLogic::initialize
and surface failures as ChainError::InvalidConfig so misconfiguration
fails the startup with a clear message.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After this branch's restructure, per-account reward application,
unspendable routing, and EWRAP-time registration filtering live in
AShard; only MIRs, refunds, and boundary close remain in Ewrap.
Update the Classify, Work Units, Source Files, and Instrumentation
tables so the bisection workflow points at the right work unit
(and module path) for each failure shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
crates/cardano/src/lib.rs (1)

250-258: Minor: deduplicate effective-shard-count computation.

The body of read_effective_account_shards (lines 253-258) is identical to the inline block at lines 345-348 in initialize. Since Self isn't yet constructed there, the method form can't be reused, but extracting a free helper (or Self::compute_effective_account_shards<D>(state, &config)) would keep both call sites in sync if the fallback rule ever changes.

♻️ Suggested refactor
-    /// Compute the effective `account_shards` value: stored
-    /// `ashard_progress.total` if a boundary is in flight, otherwise the
-    /// configured value.
-    fn read_effective_account_shards<D: Domain>(&self, state: &D::State) -> u32 {
-        load_epoch::<D>(state)
-            .ok()
-            .and_then(|e| e.ashard_progress.as_ref().map(|p| p.total))
-            .unwrap_or_else(|| self.config.account_shards())
-    }
+    /// Compute the effective `account_shards` value: stored
+    /// `ashard_progress.total` if a boundary is in flight, otherwise the
+    /// configured value.
+    fn compute_effective_account_shards<D: Domain>(state: &D::State, config: &CardanoConfig) -> u32 {
+        load_epoch::<D>(state)
+            .ok()
+            .and_then(|e| e.ashard_progress.as_ref().map(|p| p.total))
+            .unwrap_or_else(|| config.account_shards())
+    }
+
+    fn read_effective_account_shards<D: Domain>(&self, state: &D::State) -> u32 {
+        Self::compute_effective_account_shards::<D>(state, &self.config)
+    }

Then at line 345-348:

-        let effective_account_shards = load_epoch::<D>(state)
-            .ok()
-            .and_then(|e| e.ashard_progress.as_ref().map(|p| p.total))
-            .unwrap_or_else(|| config.account_shards());
+        let effective_account_shards =
+            Self::compute_effective_account_shards::<D>(state, &config);

Also applies to: 343-348

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/cardano/src/lib.rs` around lines 250 - 258, The effective-shard-count
computation is duplicated between the method read_effective_account_shards and
the inline block in initialize; extract the logic into a single helper (either a
free function like compute_effective_account_shards::<D>(state, &config) or an
associated fn Self::compute_effective_account_shards::<D>(state, &config)) that
returns the u32 by loading the epoch and falling back to
config.account_shards(), then replace both the body of
read_effective_account_shards and the inline compute in initialize to call that
helper so both call sites stay in sync if the fallback changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@crates/cardano/src/lib.rs`:
- Around line 250-258: The effective-shard-count computation is duplicated
between the method read_effective_account_shards and the inline block in
initialize; extract the logic into a single helper (either a free function like
compute_effective_account_shards::<D>(state, &config) or an associated fn
Self::compute_effective_account_shards::<D>(state, &config)) that returns the
u32 by loading the epoch and falling back to config.account_shards(), then
replace both the body of read_effective_account_shards and the inline compute in
initialize to call that helper so both call sites stay in sync if the fallback
changes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3543b5af-c9cb-43d0-942e-a2c733a526c3

📥 Commits

Reviewing files that changed from the base of the PR and between 8174d2a and aa835bb.

📒 Files selected for processing (4)
  • crates/cardano/src/lib.rs
  • crates/cardano/src/model/epochs.rs
  • crates/core/src/lib.rs
  • skills/debug-epoch-mismatch/SKILL.md
✅ Files skipped from review due to trivial changes (1)
  • crates/core/src/lib.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • skills/debug-epoch-mismatch/SKILL.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant